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The CAP theorem’s impact on modern dis- 
tributed database system design is more 
limited than is often perceived. Another 
tradeoff—between consistency and latency 
—has had a more direct influence on sev- 
eral well-known DDBSs. A proposed new 
formulation, PACELC, unifies this tradeoff 
with CAP. 


Ithough research on distributed database sys- 
tems began decades ago, it was not until recently 
that industry began to make extensive use of 
DDBSs. There are two primary drivers for this 
trend. First, modern applications require increased data 
and transactional throughput, which has led to a desire 
for elastically scalable database systems. Second, the 
increased globalization and pace of business has led to 
the requirement to place data near clients who are spread 
across the world. Examples of DDBSs built in the past 10 
years that attempt to achieve high scalability or world- 
wide accessibility (or both) include SimpleDB/Dynamo/ 
DynamoDB,' Cassandra,” Voldemort (http://project- 
voldemort.com), Sherpa/PNUTS,? Riak (http://wiki.basho. 
com), HBase/BigTable,* MongoDB (www.mongodb.org), 
VoItDB/H-Store,° and Megastore.® 
DDBSs are complex, and building them is difficult. 
Therefore, any tool that helps designers understand the 
tradeoffs involved in creating a DDBS is beneficial. The 


_ CAP theorem, in particular, has been extremely useful in 


helping designers to reason through a proposed system’s 
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capabilities and in exposing the exaggerated marketing 
hype of many commercial DDBSs. However, since its 
initial formal proof,’ CAP has become increasingly mis- 
u , potentially causing significant 
harm. In particular, many designers incorrect onclud 


Oc lita ji he face [ ertain tyt 


and does not constrain any system capabilities during 
normal operation. 
Nonetheless, the funda: adec 
In fact, one particular tradeoff— 
latency—arguably has been more influential on DDBS 
design than the CAP tradeoffs. Both sets of tradeoffs are 
important; unifying CAP and the consistency/latency trade- 
off into a single formulation—PACELC—can accordingly 
lead to a deeper understanding of modern DDBS design. 


CAP IS FOR FAILURES 
CAP basically states that in building a DDBS, designers 


(©), availability (A), and partition tolerance (P). Therefore, 
only CA systems (consistent and highly available, but not 
partition-tolerant), CP systems (consistent and partition- 
tolerant, but not highly available), and AP systems (highly 
available and partition-tolerant, but not consistent) are 
possible. 


Many modern DDBSs—including SimpleDB/Dynamo, 


Cassandra, Voldemort, Sherpa/PNUTS, and Riak—do not- 
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eee eres cca ares as defined by CAP. (Al- 
though consistency of some of these systems became 


adjustable after the initial versions were released, the focus 
here is on their original design.) In their proof of CAP, Seth 
Gilbert and Nancy Lynch’ 


Nea ~ “The 


were completed at a single instant. This is ec 


Given that early DDBS research focused on consistent 
systems, it is natural to assume that CAP was a major influ- 
ence on modern system architects, who, during the period 
after the theorem was proved, built an increasing number 


of systems implementing reduced consistency models. 
The reasoning behind this assumption is that, because any 
DDBS must be tolerant of network partitions, according to 
CAP, the system must choose between high availability and 
consistency. For mission-critical applications in which high 
availability is extremely important, it has no choice but to 
sacrifice consistency. 

However, this logic is flawed and not consistent with 


what CAP actually says. It is not merely the partition toler- 


e —-— 


+ the existence of a network partition itself 
The theorem ua states tha 
Od ae Detwee Cau g 
cy. The probability of a network 
partition is highly dependent on the various details of the 
system implementation: Is it distributed over a wide area 
network (WAN), or just a local cluster? What is the quality 
of the hardware? What processes are in place to ensure 
that changes to network configuration parameters are 
performed carefully? What is the level of redundancy? 
Nonetheless, in general, 


YAI ere 


As CAP imposes no system restrictions in the base- 
line case, it is wrong to assume that DDBSs that reduce 
consistency in the absence of any partitions are doing so 


the system to make the complete set of ACID (atomicity, 

i herefore, 
the theorem does not completely justify the default con- 
figuration of DDBSs that reduce consistency (and usually 
several other ACID guarantees). 


CONSISTENCY/LATENCY TRADEOFF 

To understand modern DDBS design, it is important 
to realize the context in which these systems were built. 
Amazon originally designed Dynamo to serve data to the 
core services in its e-commerce platform (for example, the 
shopping cart). Facebook constructed Cassandra to power 
its Inbox Search feature. LinkedIn created Voldemort to 
handle online updates from various write-intensive fea- 
tures on its website. Yahoo built PNUTS to store user data 
that can be read or written to on every webpage view, to 
store listings data for Yahoo’s shopping pages, and to store 
data to serve its social networking applications. Use cases 
similar to Amazon’s motivated Riak. 

In each case, the system typically serves data for web- 
pages constructed on the fly and shipped to an active 


website user, and receives online updates. Studies indi- 
cate that r E RRO 


an increase as small as 100 ms can dramatically reduce 
the probability that a customer will continue to interact or 
return in the future? 
Unfortunately, t 
i ote that availability 


and latency are arguably the same thing: 
paint A o pur- 


poses of this discussion, I consider systems with latencies 
larger than a typical request timeout, such as a few seconds, 
as unavailable, and latencies smaller than a request timeout, 
but still approaching hundreds of milliseconds, as “high 
latency.” However, I will eventually drop this distinction and 
allow the low-latency requirement to subsume both cases. 


Therefore, the tradeoff is really just between consistency — 
and latency, as this section’s title suggests.) 


This tradeoff exists even when there are no network _ 


nd thus is completely separate from the trade- 
offs CAP describes. Nonetheless, it is a critical factor in the 
design of the above-mentioned systems. (It is irrelevant to 
this discussion whether or not a single machine failure is 
treated like a special type of network partition.) 

The reason for the tradeoff is that a high availability 
requirement implies that the system must replicate data. 
If the system runs for long enough, at least one compo- 
nent in the system will eventually fail. When this failure 
occurs, all data that component controlled will become 
unavailable unless the system replicated another version 
of the data prior to the failure. Therefore, 


due to CAP-based decision-making, In Fact, QI O 
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important difference between this tradeoff and the CAP 
tradeoffs: while the occurrence of a failure causes the CAP 
tradeoffs, the failure possibility itself results in this tradeoff.) 

To achieve the highest possible levels of availability, a 
DDBS must replicate data over a WAN to protect against 
the failure of an entire datacenter due, for example, to a 
hurricane, terrorist attack, or, as in the famous April 2011 
Amazon EC2 cloud outage, a single network configuration 
error. The five reduced-consistency systems mentioned 
above are designed for extremely high availability and 
usually for replication over a WAN. 


DATA REPLICATION 
‘As BOon ad a DOES replicas data, atradeott Rn 
i ises. This occurs because there 
are only three alternatives for implementing data replica- 
tion: 


s ; 
single (arbitrary) node first. The system can implement 


each of these cases in various ways; however, each imple- 
mentation comes with a consistency/latency tradeoff. 


(1) Data updates sent to all replicas at the 
same time 


a clear lack of consistency—could ensue (assuming 
multiple updates to the system are submitted concurrently, 


for example, from different clients), as ae 


(Even if all updates are commutative—such that each rep- 
lica will eventually become consistent, despite the fact 
that the replicas could possibly apply updates in different 
orders—Gilbert and Lynch’s strict definition of consis- 


tency’ still does not hold. However, generalized Paxos!? 


On the other hand, if updates first pass through a pre- 
processing layer or all nodes involved in the write use an 
agreement protocol to decide on the order of operations, 
then it is possible to ensure that all replicas will agree on 
the order in which to process the updates. However, this- 
leads to several sources of increased latency. In the case of 
the agreement protocol, the protocol itself is the additional 
source of latency. 

In the case of the preprocessor, there are two sources 
of latency. First, routing updates through an additional 
system component (the preprocessor) increases latency. 
Second, the preprocessor consists of either multiple 
machines or a single machine. In the former case, an agree- 
ment protocol to decide on operation ordering is needed 
across the machines. In the latter case, the system forces 
all updates, no matter where they are initiated—potentially 
anywhere in the world—to route all the way to the single 
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preprocessor first, even if another data replica is nearer to 
the update initiation location. 


(2) Data updates sent to an agreed-upon 
location first 


I will refer to this agreed-upon location as a “master 
node” (different data items can have different master 


There are three replication options: 


a. The replication is synchronous: the master node 
waits until all updates have made it to the replicajs). 
This ensures that the replicas remain consistent, but 


synchronous actions across independent entities, 
especially over a WAN, increase latency due to the re- 
quirement to pass messages between these entities and 
the fact that latency is limited by the slowest entity. 

b. The replication is asynchronous: the system treats 
the update as if it were completed before it has been 
replicated. Typically, the update has at least made it to 
stable storage somewhere before the update’s initiator 
learns that it has completed (in case the master node 
fails), but there are no guarantees that the system has 
propagated the update. The consistency/latency trade- 
off in this case depends on how the system deals with 
reads: 

i. Ifthe system routes all reads to the master node 
and serves them from there, then there is no reduc- 
tion in consistency. However, there are two latency 
problems with this approach: 


1. Even if there is a replica close to the read- 
i ER 


, which po- 
tentially could be physically much farther 
away. 


2. If the master node is overloaded with other 
requests or has failed, there is no option to- 


Rather, 
jh" other words, 
ack of load balancing options increases 


latency potential. 
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ii. If the system can serve reads from any node, 
read latency is much better, but this can also 


r m 
as different locations have different versions 
_ of a data item while the system is still propagat- 


Bias". and it could send a read to any of 
these locations. Although the level of reduced 
consistency can be bounded by keeping track 
-of update sequence numbers and using them to 
implement sequential/timeline consistency or 


read-your-writes consistency, these are none- 
theless reduced consistency options. Further- 


more, write latency can be high if the master _ 
node for a write operation is geographically 
distant from the write requester. 


c. A combination of (a) and (b) is possible: the system 
sends updates to some subset of replicas synchro- 
nously, and the rest asynchronously. The consistency/ 


latency tradeoff in this case again is determined by 
how the system deals with reads: 

i. If it routes reads to at least one node that has 
been synchronously updated—for example, 
when R + W > Nina quorum protocol, where R 
is the number of nodes involved in a synchro- 
nous read, W is the number of nodes involved 
in a synchronous write, and N is the number of 
replicas—then consistency can be preserved. 
However, the latency problems of (a), (b)(i)(1), 
and (b)(i)(2) are all present, though to somewhat 
lesser degrees, as the number of nodes involved 
in the synchronization is smaller, and more than 
one node can potentially serve read requests. 

ii. If it is possible for the system to serve reads from 
nodes that have not been synchronously 
updated, for example, when R + W < N, then 
inconsistent reads are possible, as in (b)(ii). 

Technically, simply using a quorum protocol is not 

sufficient to guarantee consistency to the level defined 

by Gilbert and Lynch. However, the protocol additions 

needed to ensure complete consistency" are not rel- 

evant here. Even without these additions, latency is 
col. 


(3) Data updates sent to an arbitrary location 
first 

The system performs updates at that location, and then 
propagates them to the other replicas. The difference be- 


40 COMPUTER 


| | r2aba.indd 40 


tween this case and (2) is that 


. For example, two different updates for a par- 
ticular data item can be initiated at two different locations 
simultaneously. 

The consistency/latency tradeoff again depends on two 
options: 


a. If replication is synchronous, then the latency prob- 
lems of (2)(a) are present. Additionally, the system can 
incur e 


b. If replication is asynchronous, then consistency prob- 
lems similar to (1) and (2)(b) are present. 


TRADEOFF EXAMPLES 

No matter how a DDBS replicates data, clearly it must 
trade off consistency and latency. For carefully controlled 
replication across short distances, reasonable options 
such as (2)(a) exist because network communication la- 
tency is small in local datacenters; however, 


o more fully understand the tradeoff, it is helpful to 
consider how four DDBSs designed for extremely high 
availability—Dynamo, Cassandra, PNUTS, and Riak— 
replicate data. As these systems were designed for low- 
latency interactions with active Web clients, each one 
sacrifices consistency for improved latency. 

use a combination of 


(2)(c) and (3). In particular, the system = 
t 


that is, case (2)(c). The system sends 
reads synchronously to R nodes, with R + W typically 
being set toa number less than or equal to N, pe anl 


ticular data item—for example, this can happen i i 
ae 


leads to the situation described in (3) 


Arecent study by Jun Rao, Eugene Shekita, and Sandeep 
Tata’ provides further evidence of the consistency/latency 
tradeoff in these systems’ baseline implementation. The 
researchers experimentally evaluated two options in Cas- 
sandra’s consistency/latency tradeoff. The first option, 
“weak reads,” allows the system to service reads from any 
replica, even if that replica has not received all outstand- 
ing updates for a data item. The second option, “quorum 
reads,” requires the system to explicitly check for incon- 
sistency across multiple replicas before reading data. The 
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second option clearly increases consistency at the cost of 
additional latency relative to the first option. The differ- 
ence in latency between these two options can be a factor 
of four or more. 

Another study by Hiroshi Wada and colleagues" seems 
to contradict this result. These researchers found that 
requesting a consistent read in SimpleDB does not signifi- 
cantly increase latency relative to the default (potentially 
inconsistent) read option. However, the researchers per- 
formed these experiments in a single Amazon region (US 
West), and they speculate that SimpleDB uses master-slave 
replication, which is possible to implement with a modest 
latency cost if the replication occurs over a short dis- 
tance. In particular, Wada and colleagues concluded that 
SimpleDB forces all consistent reads to go to the master in 
charge of writing the same data. As long as the read request 
comes from a location that is physically close to the master, 
and as long as the master is not overloaded, then the ad- 
ditional latency of the consistent read is not visible (both 
these conditions were true in their experiments). 

If SimpleDB had replicated data across Amazon regions, 
and the read request came from a different region than the 
master’s location, the latency cost of the consistent read 
would have been more apparent. Even without replication 
across regions (SimpleDB does not currently support rep- 
lication across regions), official Amazon documentation 
warns users of increased latency and reduced throughput 
for consistent reads. 

All four DDBSs allow users to change the default pa- 
rameters to exchange increased consistency for worse 
latency—for example, by making R + W more than Nin 
quorum-type systems. Nonetheless, the consistency/la- 
tency tradeoff occurs during normal system operation, 
even in the absence of network partitions. This tradeoff is 
magnified if there is data replication over a WAN. The obvi- 
ous conclusion is that reduced consistency is attributable 
to runtime latency, not CAP. 

PNUTS offers the clearest evidence that CAP is not a 
major reason for reduced consistency levels in these sys- 
tems. In PNUTS, a master node owns each data item. The 
system routes updates to that item to the master node, and 
then propagates these updates asynchronously to repli- 
cas over a WAN. PNUTS can serve reads from any replica, 
which puts the system into category (2)(b)(ii): it reduces 
consistency to achieve better latency. However, in the case 
of a network partition, 


tes. In other 


Therefore, the choice to reduce consistency in the base- 
line case is more obviously attributable to the continuous 
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consistency/latency tradeoff than to the consistency/avail- 
ability tradeoff in CAP that only occurs upon a network 
partition. Of course, PNUTS’s lack of consistency in the 
baseline case is also helpful in the network partition case, 
as data mastered in an unavailable partition is still acces- 
sible for reads. 

CAP arguably has more influence on the other three 


systems. Dynamo, Cassandra, and Riak switch more fully 
is therefore reason- 
able to assume that these systems were designed with 


the possibility of a network partition in mind. Because 
these are AP systems, the reconciliation code and abil- 


ity to switch to (3) were built into the code from the 
beginning. However, once that code was there, it is con- 
venient to reuse some of that consistency flexibility to 
choose a point in the baseline consistency/latency trade- 
off as well. This argument is more logical than claims 
that these systems’ designers chose to reduce consistency 
entirely due to CAP (ignoring the latency factor). 

In conclusion, CAP is only one of the two major reasons 
that modern DDBSs reduce consistency. Ignoring the con- 
sistency/latency tradeoff of replicated systems is a major 
oversight, as it is present at all times during system opera- 
tion, whereas CAP is only relevant in the arguably rare case 
of a network partition. In fact, the former tradeoff could be 
more influential because it has a more direct effect on the 
systems’ baseline operations. 


PACELC 


A more complete portrayal of the space of potential 
consistency tradeoffs for DDBSs can be achieved by rewrit- 


ing CAP as PACELC (pronounced “pass-elk”): ifthere isa | 
C, else (E), when the system is running 


Caii pinnan. Otherwise, the 
system suffers from availability issues upon any type of 


failure or overloaded node. Because such issues are just 
instances of extreme latency, the latency part of the ELC 
tradeoff can incorporate the choice of whether or not to 
replicate data. 
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Pe tes default versions of Dynamo, Cassandra, and Riak 


ier ert gr A 


e En Biin up Doth cs 
in PACELC makes the design simpler; once a system is 


However, these systems have user-adjustable settings to 
alter the ELC tradeoff—for example, by increasing R + W, 
they gain more consistency at the expense of latency (al- 
though they cannot achieve full consistency as defined by 
Gilbert and Lynch, even if R + W > N). 


store are PC/EC: they refuse to give up consistency, and 


nd related systems such as HBase are also 
PCIEC. 

In the 
baseline case, the system guarantees reads and writes to 
be consistent. However, MongoDB uses data replication 
option (2), and if the master node fails or is partitioned 
from the rest of the system, it stores all writes that have 
been sent to the master node but not yet replicated ina 
local rollback directory. Meanwhile, 


writes. Therefore, 


(Technically, when a partition occurs, MongoDB is not 
available according to the CAP definition of availability, 
as the minority partition is not available. However, in the 
context of PACELC, because a partition causes more con- 
sistency issues than availability issues, MongoDB can be 
classified as a PA/JEC system.) 


gives up consistency for latency; however, if a partition 
occurs, it trades availability for consistency. This is admit- 


tedly somewhat confusing: according to PACELC, PNUTS 
appears to get more consistent upon a network partition. 
However, PC/EL should not be interpreted in this way. PC _ 


beyond the baseline consistency level when a network 


_ partition occurs—instead, it reduces availability. 


he tradeoffs involved in building distributed data- 

base systems are complex, and neither CAP nor 

PACELC can explain them all. Nonetheless, incorpo- 
rating the consistency/latency tradeoff into modern DDBS 
design considerations is important enough to warrant 
bringing the tradeoff closer to the forefront of architec- 
tural discussions. 
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